Fundamental Techniques in Data Science with R

Packages and functions used

library(magrittr) # pipes
library(dplyr)    # data manipulation
library(ggplot2)  # plotting
library(mice)     # for boys data
library(GGally)   # correlation and distribution matrixplot
library(mctest)   # multicollinearity test
titanic <- read.csv(file = "titanic.csv", header = TRUE) %>%
  mutate(Pclass = factor(Pclass, 
                         levels = c(3, 2, 1), 
                         ordered = FALSE))

Recap

Linear modeling

Assumptions

Linear regression

  • requires a linear relation between between the dependent and independent variables.
  • requires the error terms (residuals) to be normally distributed.
  • requires homoscedasticity

    • i.e. the standard deviations of the residuals are constant and do not depend on the independent variables
  • requires the dependent variable to be measured on an interval or ratio scale
  • assumes that there is little or no multicollinearity in the data.

Modeling

A simple model:

anscombe %$% lm(y1 ~ x1) %>% summary
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Modeling

A more complex model:

anscombe %$% lm(y1 ~ x1 + x2) %>% summary
## 
## Call:
## lm(formula = y1 ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## x2                NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Modeling

The system on the previous slide is highly multicollinear.

anscombe %$% identical(x1, x2)
## [1] TRUE

x1 and x2 are identical and there is no unique way to distribute the regression parameters over x1 and x2.

As a result, the model paramaters cannot be estimated unbiasedly, so R’s lm() function will give you the lower rank approximation with only one parameter. The resulting model outcome is still valid!

Modeling

A more complex model:

anscombe2 <- anscombe %>%
  mutate(y5 = factor(ifelse(y3 > 7.5, "yes", "no")))
fit <- anscombe2 %$% lm(x1 ~ y1 + y2 * y5)

Parameters

fit %>% summary
## 
## Call:
## lm(formula = x1 ~ y1 + y2 * y5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4769 -0.3182 -0.1109  0.2981  0.7319 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42965    0.76580   0.561 0.595084    
## y1           0.06465    0.14272   0.453 0.666479    
## y2           0.91014    0.12688   7.173 0.000371 ***
## y5yes       32.21191    5.30427   6.073 0.000905 ***
## y2:y5yes    -3.26437    0.59856  -5.454 0.001582 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.519 on 6 degrees of freedom
## Multiple R-squared:  0.9853, Adjusted R-squared:  0.9755 
## F-statistic: 100.6 on 4 and 6 DF,  p-value: 1.254e-05

Prediction

coef(fit)
## (Intercept)          y1          y2       y5yes    y2:y5yes 
##  0.42964653  0.06465026  0.91014154 32.21191128 -3.26437210
new <- data.frame(y1 = c(5, 5), y2 = c(8, 8), y5 = as.factor(c("yes", "no")))
predict(fit, newdata = new)
##        1        2 
## 14.13096  8.03403
0.42964653 + 0.06465026 * 5 + 0.91014154 * 8 + 32.21191128  + -3.26437210 * 8 # y5 = "yes"
## [1] 14.13096
0.42964653 + 0.06465026 * 5 + 0.91014154 * 8 #y5 = "no"
## [1] 8.03403

Logistic modeling

Assumptions

Logistic regression

  • does not require a linear relation between between the dependent and independent variables.
  • does not require the error terms (residuals) to be normally distributed.
  • does not require homoscedasticity:

    • So the standard deviations of the residuals may depend on the independent variables!
  • does not require the dependent variable to be measured on an interval or ratio scale

Assumptions continued

However, logistic regression

  • requires the outcome to be binary
  • requires the observations to be independent of each other
  • requires there to be little or no multicollinearity among the independent variables.

    • i.e. the independent variables should not be too highly correlated with each other.
  • assumes linearity of independent variables and log odds.
  • typically requires a large sample size:

    • a rule of thumb: at least 10 cases with the least frequent outcome for each independent variable:
    • If \(P(\text{least frequent}) = .2\), then for 4 predictors you need (10*4) / .2 =200 cases

Modeling

A simple logistic model:

titanic %$% glm(Survived ~ Age)
## 
## Call:  glm(formula = Survived ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##    0.446210    -0.002058  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  885 Residual
## Null Deviance:       210.1 
## Residual Deviance: 209.4     AIC: 1243

Modeling 2

A more complex model:

titanic %$% glm(Survived ~ Age * Pclass)
## 
## Call:  glm(formula = Survived ~ Age * Pclass)
## 
## Coefficients:
## (Intercept)          Age      Pclass2      Pclass1  Age:Pclass2  
##    0.403619    -0.006323     0.346715     0.570724    -0.002968  
## Age:Pclass1  
##   -0.002564  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  881 Residual
## Null Deviance:       210.1 
## Residual Deviance: 176.9     AIC: 1101

Parameters

fit <- titanic %$% glm(Survived ~ Age, 
                       family = binomial(link="logit"))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0864  -1.0017  -0.9439   1.3562   1.5806  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.209189   0.159494  -1.312   0.1897  
## Age         -0.008774   0.004947  -1.774   0.0761 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.8  on 886  degrees of freedom
## Residual deviance: 1179.6  on 885  degrees of freedom
## AIC: 1183.6
## 
## Number of Fisher Scoring iterations: 4

Prediction

new <- data.frame(Age = 40)
predict(fit, newdata = new, type = "response")
##         1 
## 0.3635098
coef(fit)
##  (Intercept)          Age 
## -0.209188757 -0.008774355
logodds <- -0.209188757 + (40 * -0.008774355)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##     logodds     odds      prob    plogis
## 1 -0.560163 0.571116 0.3635098 0.3635098

Example 2

fit <- titanic %$% glm(Survived ~ Age * Pclass, 
                       family = binomial(link="logit"))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age * Pclass, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0888  -0.8368  -0.6521   1.0292   2.3197  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.222282   0.246559  -0.902 0.367303    
## Age         -0.038062   0.009838  -3.869 0.000109 ***
## Pclass2      1.306741   0.458158   2.852 0.004342 ** 
## Pclass1      2.365100   0.528674   4.474 7.69e-06 ***
## Age:Pclass2 -0.002199   0.015564  -0.141 0.887623    
## Age:Pclass1 -0.002480   0.014709  -0.169 0.866133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.8  on 886  degrees of freedom
## Residual deviance: 1036.9  on 881  degrees of freedom
## AIC: 1048.9
## 
## Number of Fisher Scoring iterations: 4

Confidence intervals

confint(fit)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -0.70582264  0.26254796
## Age         -0.05784190 -0.01923168
## Pclass2      0.42354482  2.22513745
## Pclass1      1.35339096  3.43128459
## Age:Pclass2 -0.03318941  0.02801718
## Age:Pclass1 -0.03152570  0.02625743

Predict a 40 year old in 2nd class

coef(fit)[c(1:3, 5)]
##  (Intercept)          Age      Pclass2  Age:Pclass2 
## -0.222282007 -0.038061512  1.306741345 -0.002199355
new <- data.frame(Age = 40, 
                  Pclass = as.factor(2))
predict(fit, newdata = new, type = "response")
##         1 
## 0.3714561
logodds <- -0.222282007 + (40 * -0.038061512) + 1.306741345 + (40 * -0.002199355)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##      logodds      odds      prob    plogis
## 1 -0.5259753 0.5909787 0.3714561 0.3714561

Predict 23/65yr in 3rd

coef(fit)[1:2]
## (Intercept)         Age 
## -0.22228201 -0.03806151
new <- data.frame(Age = c(23, 65), 
                  Pclass = as.factor(3))
predict(fit, newdata = new, type = "response")
##         1         2 
## 0.2501717 0.0631932
logodds <- -0.22228201 + (c(23, 65) * -0.03806151)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##     logodds       odds       prob     plogis
## 1 -1.097697 0.33363866 0.25017170 0.25017170
## 2 -2.696280 0.06745597 0.06319321 0.06319321

Plot logodds

Plot probabilities

Multicollinearity

Visualizing

ggpairs(anscombe, progress = FALSE)

How to inspect

fit <- anscombe %$% lm(y1 ~ x1 + x2)

Overall diagnostics

anscombe %$% omcdiag(cbind(x1, x2), y1)
## 
## Call:
## omcdiag(x = cbind(x1, x2), y = y1)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                         MC Results detection
## Determinant |X'X|:     0.00000e+00         1
## Farrar Chi-Square:             Inf         1
## Red Indicator:         1.00000e+00         1
## Sum of Lambda Inverse:         Inf         1
## Theil's Method:        1.33350e+00         1
## Condition Number:      1.14771e+08         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

individual diagnostics

anscombe %$% imcdiag(cbind(x1, x2), y1)
## 
## Call:
## imcdiag(x = cbind(x1, x2), y = y1)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##    VIF TOL  Wi  Fi Leamer CVIF Klein IND1 IND2
## x1 Inf   0 Inf Inf      0 -Inf     1    0    1
## x2 Inf   0 Inf Inf      0 -Inf     1    0    1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## * all coefficients have significant t-ratios
## 
## R-square of y on all x: 0.6665 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Which regressors

anscombe %$% imcdiag(cbind(x1, x2), y1, method = "VIF")
## 
## Call:
## imcdiag(x = cbind(x1, x2), y = y1, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##    VIF detection
## x1 Inf         1
## x2 Inf         1
## 
## Multicollinearity may be due to x1 x2 regressors
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

Another example

boys %$% omcdiag(cbind(bmi, wgt, hgt), age)
## 
## Call:
## omcdiag(x = cbind(bmi, wgt, hgt), y = age)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.0172         0
## Farrar Chi-Square:      2941.2661         1
## Red Indicator:             0.7927         1
## Sum of Lambda Inverse:    64.7307         1
## Theil's Method:            0.8489         1
## Condition Number:         54.9661         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

Individual diagnostics

boys %$% imcdiag(cbind(bmi, wgt, hgt), age)
## 
## Call:
## imcdiag(x = cbind(bmi, wgt, hgt), y = age)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##         VIF    TOL        Wi       Fi Leamer    CVIF Klein  IND1   IND2
## bmi  6.4429 0.1552  1970.329  3946.10 0.3940 -0.2016     0 4e-04 0.9148
## wgt 37.1650 0.0269 13091.715 26219.60 0.1640 -1.1631     1 1e-04 1.0537
## hgt 21.1228 0.0473  7284.462 14589.05 0.2176 -0.6610     0 1e-04 1.0316
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## * all coefficients have significant t-ratios
## 
## R-square of y on all x: 0.9608 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Overall diagnostics

boys %$% imcdiag(cbind(bmi, wgt, hgt), age, method = "VIF")
## 
## Call:
## imcdiag(x = cbind(bmi, wgt, hgt), y = age, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##         VIF detection
## bmi  6.4429         0
## wgt 37.1650         1
## hgt 21.1228         1
## 
## Multicollinearity may be due to wgt hgt regressors
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

Example without MC

boys %$% omcdiag(cbind(wgt, hgt), age, method = "VIF")
## 
## Call:
## omcdiag(x = cbind(wgt, hgt), y = age, method = "VIF")
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.1110         0
## Farrar Chi-Square:      1592.8922         1
## Red Indicator:             0.9429         1
## Sum of Lambda Inverse:    18.0249         1
## Theil's Method:            0.8189         1
## Condition Number:         18.9067         0
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

Example without MC

boys %$% imcdiag(cbind(wgt, hgt), age, method = "VIF")
## 
## Call:
## imcdiag(x = cbind(wgt, hgt), y = age, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##        VIF detection
## wgt 9.0125         0
## hgt 9.0125         0
## 
## NOTE:  VIF Method Failed to detect multicollinearity
## 
## 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

The actual correlation

boys %>% 
  select(wgt, hgt, age) %>%
  cor(use = "pairwise.complete.obs")
##           wgt       hgt       age
## wgt 1.0000000 0.9428906 0.9505762
## hgt 0.9428906 1.0000000 0.9752568
## age 0.9505762 0.9752568 1.0000000

Model fit

Model fit summary

See the previous lectures for detailed inspections of model fit

In short:

\[\text{A model with the same fit but fewer parameters, is the better model}\]

Techniques: - Use the anova() function to compare nested models: - Use the AIC() function to campare any model on the same outcome:

  • The AIC yields the log-likelihood of the model given the data

Example

fit1 <- boys %>% select(age, hgt, wgt) %>% na.omit() %$% lm(age ~ hgt)
fit2 <- boys %>% select(age, hgt, wgt) %>% na.omit() %$% lm(age ~ hgt + wgt)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: age ~ hgt
## Model 2: age ~ hgt + wgt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    725 1679.0                                  
## 2    724 1400.6  1    278.46 143.94 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(fit1, fit2)
##      df      AIC
## fit1  3 2677.679
## fit2  4 2547.849

Why the fuzz?

fit1 %>% summary() %>% .$coefficients
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -9.7562988 0.170397472 -57.25612 3.252757e-271
## hgt          0.1443346 0.001215674 118.72803  0.000000e+00
fit2 %>% summary() %>% .$coefficients
##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -7.45952488 0.246781862 -30.22720 1.871803e-130
## hgt          0.10660192 0.003335514  31.95967 1.759561e-140
## wgt          0.07137607 0.005949200  11.99759  2.242865e-30

In this scenario, the overall interpretation of the parameter hgt did not change. However, multicollinearity may:

  • make choosing the efficient set of predictors challenging
  • make it harder to determine the precise effect of predictors

The overall fit of the model and the predictions are not affected by multicollinearity